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SYSTEM FOR SYNCHRONIZATION 
BETWEEN MOVING PICTURE AND A TEXT- 
TO-SPEECH CONVERTER 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to a system for synchroni- 
zation between moving picture and a text-to-speech(TTS) 
converter, and more parti culary to a system for synchroni- 
zation between moving picture and a text-to-speech con- 
verter which can be realized a synchronization between 
moving picture and synthesized speech by using the moving 
time of lip and duration of speech information. 

2. Description of the Related Art 

In general, a speech synthesizer provides a user with 
various types of information in an audible form. For this 
purpose, the speech synthesizer should provide a high qual- 
ity speech synthesis service from the input texts given to a 
user. In addition, in order for the speech synthesizer to be 
operatively coupled to a database constructed in a multi- 
media environment, or various media provided by a coun- 
terpart involved in a conversation, the speech synthesizer 
can generate a synthesized speech so as to be synchronized 
with these media. In particular, the synchronization between 
moving picture and the TTS is essentially required to 
provide a user with a high quality service. 

FIG. 1 shows a block diagram of a conventional text-to- 
speech converter which generally consists of three steps in 
generating a synthesized speech from the input text. 

At step 1. a language processing unit 1 converts an input 
text to a phoneme string, estimates prosodic information, 
and symbolizes it. The symbol of the prosodic information 
is estimated from the phrase boundary, clause boundary, 
accent position, sentence patterns, etc. by analyzing a syn- 
tactic structure. At step 2. a prosody processing unit 2 
calculates the values for prosody control parameters from 
the symbolized prosodic information by using rules and 
tables. The prosody control parameters include phoneme 
duration and pause interval information. Finally, a signal 
processing unit 3 generates a synthesized speech by using a 
synthesis unit DB 4 and the prosody control parameters. 
That is, the conventional synthesizer should estimate pro- 
sodic information related to naturalness and speaking rate 
only from an input text in the language processing unit 1 and 
the prosody processing unit 2. 

Presently, a lot of researches on the TTS have been 
conducted through the world for application to mother 
languages, and some countries have already started a com- 
mercial service. However, the conventional synthesizer is 
aimed at its use in synthesizing a speech from an input text 
and thus there is no research activity on a synthesizing 
method which can be used in connection with multi-media. 
In addition, when dubbing is performed on moving picture 
or animation by using the conventional TTS method, infor- 
mation required to implement the synchronization of media 
with a synthesized speech cannot be estimated from the text 
only. Thus, it is not possible to generate a synthesized 
speech, which is smoothly and operatively coupled to mov- 
ing pictures, from only text information. 

If the synchronization between moving picture and a 
synthesized speech is assumed to be a kind of dubbing, there 
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can be three implementation methods. One of these methods 
includes a method of synchronizing moving picture with a 
synthesized speech on a sentence basis. This method regu- 
lates the time duration of the synthesized speech by using 

5 information on the start point and end point of the sentence. 
This method has an advantage that it is easy to implement 
and the additional efforts can be minimized. However, the 
smooth synchronization cannot be achieved with this 
method. As an alternative, there is a method wherein infor- 

10 mation on the start and end point, and phoneme symbol for 
every phoneme are transcribed in the interval of the moving 
picture related to a speech signal to be used in generating a 
synthesized speech. Since the synchronization of moving 
picture with a synthesized speech can be achieved for each 

15 phoneme with this method, the accuracy can be enhanced. 
However, this method has a disadvantage that additional 
efforts should be exerted to detect and record time duration 
information for every phoneme in a speech interval of the 

^ moving picture. 

As another alternative, there is a method wherein syn- 
chronization information is recorded based on patterns hav- 
ing the characteristic by which a Up motion can be easily 
distinguished, such as the start and end points of the speech, 
25 the opening and closing of the lip, protrusion of the lip. etc. 
This method can enhance the efficiency of synchronization 
while rmnimizing the additional efforts exerted to make 
information for synchronization. 

30 SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to provide 
a method of formatting and normalizing continuous lip 
motions to events in a moving picture besides a text in a 
35 text-to-speech converter. 

It is another object of the invention to provide a system for 
synchronization between moving picture and a synthesized 
speech by defining an interface between event information 
and the ITS and using it in generating the synthesized 
speech. 

In accordance with one aspect of the present invention, a 
system for synchronization between moving picture and a 
text-to-speech converter is provided which comprises dis- 

45 tributing means for multi-media input information, trans- 
forming it into the respective data structures, and distribut- 
ing it to each medium; image output means for receiving 
image information of the multi-media information from said 
distributing means; language processing means for receiving 

50 language texts of the multi-media information from said 
distributing means, transferring the text into phoneme 
string, estimating and symbolizing prosodic information; 
prosody processing means for receiving the processing 
result from said language processing means, calculating the 

55 values of prosodic control parameters; synchronization 
adjusting means for receiving the processing results from 
said prosody processing means, adjusting time durations for 
every phoneme for synchronization with image signals by 
using synchronization information of the multi-media infor- 

60 mation from said distributing means, and inserting the 
adjusted time durations into the results of said prosody 
processing means; signal processing means for receiving the 
processing results from said synchronization adjusting 
means to generate a synthesized speech; and a synthesis unit 

65 database block for selecting required unit for synthesis in 
accordance with a request from said signal processing 
means, and transferring the required data. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



TABLE 1 



The present invention will become more apparent upon a — — ^— 

detailed description of the preferred embodiments for car- Example of Synchronization information 

rying out the invention as rendered below. In the description 5 in pu t 

to follow, references Will be made tO the accompanying Information Parameter Parameter Value 



text sentence 



drawings, where like reference numerals are used to identify 

like or similar elements in the various drawings and in moving picture scene 

which: synchronization lip shape degree of a down motion of 

10 information a lower lip, up and down 

FIG. 1 shows a block diagram of a conventional text-to- motion at the left edge of 

, an upper lip, up and down 

speech converter; mouoTat L right edge of 

FIG. 2 shows a block diagram of a synchronization n^^t^teft^o? 
system in accordance with the present invention; 15 a lower Up, up and down 

motion at the right edge of 

FIG. 3 shows a detailed block diagram to illustrate a a lower up, up and down 

method of synchronizing a text-to-speech converter; and motion center 

portion of an upper Up, up 

FIG. 4 shows a flow chart to illustrate a method of and down motion at the 

... . ™ center portion of a lower 

synchronizing a text-to-speech converter. lv lip, degree of promising 

of an upper Up, degree of 

DETAILED DESCRIPTION OF THE Sf^Kof 

INVENTION a Up to the right edge of 

^ a Up, and distance from 

FIG. 2 shows a block diagram of a synchronization 25 SSCTcf/i? * 

system in accordance with the present invention. In FIG. 2. information position of scene in moving 

reference numerals 5. 6, 7, 8 and 9 indicate a multi-data on P 0 ^ 1100 P icturc 

... time number of continuous scenes 

input unit, a central processing unit, a synthesized database, duration 

a digital/analog(D/A) converter, and an image output unit, 30 — — 

respectively. P^^, ^ s jj OWS a detailed block diagram to illustrate a 

Data comprising multi-media such as an image, text, etc. method of synchronizing a text-to- speech converter and 
is inputted to the multi-data input unit 5 which outputs the HG " 4 shows a flow chart t0 mustrate a method of synchro- 

input data to the central processing unit 6. Into the central 35 mzing , a text-t<>-speech converter. In FIG. 3, reference 

processing unit 6, the algorithm in accordance with the nU ™ f L f U * } X 14 15 ' 16 . and 17 V^V 

, . * . . , j , . , * multi-media information input unit, a multi-media 

present invention is embedded. The synthesized database 7, distributor , a standardi2ed language processing unit, a 
a synthesized DB for use in the synthesis algorithm is stored pr0 sody processing unit, a synchronization adjusting unit, a 
in a storage device and transmits necessary data to the signal processing unit, a synthesis unit database, and an 
central processing unit 6. The digital/analog converter 8 40 image output unit, respectively 

converts the synthesized digital data into an analog signal to The multi-media information in the multi-media informa- 
output it to the exterior. The image output unit 9 displays the tion input unit 10 is structured in a format as shown above 
input image information on the screen. m table L and comprises a text, moving picture, lip shape. 

information on positions in the moving picture, and infor- 
Table 1 as shown below illustrates one example of struc- 45 mation on time durations. The multi-media distributor 11 
tured multi-media input information to be used in connec- receives the multi-media information from the multi-media 
tion with the present invention. The structured information information input unit 10, and transfers images and texts of 
includes a text, moving picture, lip shape, information on the multi-media information to the image output unit 17 and 
positions in the moving picture, and information on the time tne language processing unit 12, respectively. When the 
duration. The tip shape can be transformed into numerical 50 synchronization information is transferred, it is converted 
values based on a degree of a down motion of a lower lip, into a ^ stnicture which <* n be used in the synchroniza- 
up and down motion at the left edge of an upper lip, up and tlon ad J ustm S umt 14 

down motion at the right edge of an upper lip. up and down ^ language processing unit 12 converts the texts 
motion at the left edge of a lower lip, up and down motion received from me multi-media distributor 11 into a phoneme 
at the right edge of a lower Hp, up and down motion at the 55 *™£ cs ^ ics and symboUze proso^icinformation to 
center portion of an upper lip. up and down motion at the £ nsfer ^ f 0 ^/ P rocesSin S "^t 13. The symbols for 
center portion of a lower Up. degree of protrusion of an ^ ^<™*«™ ™ estimated from the phrase 

,r™~ i.Ch^ ^ +** ' f t i- a- . L boundary, clause boundary, the accent position, and sentence 

upper hp, degree of r^otrusion of a lower lip distance from ttern etc. by using the results of analysis of syntax 
the center of a lip to the right edge of a lip. and distance from ^ structures 

^^^.^^.ff^^^^^^ The prosody processing unit 13 receives the processing 
also be denned in a quantified and normalized pattern in results from me ^ processing unit 12. and calculates 
accordance with the position and manner of articulation for ^ values of the prosodic control parameters. The prosodic 
each phoneme. The information on positions is defined by control parameter includes the time duration of phonemes, 
the position of a scene in a moving picture, and the time « contour of pitch, contour of energy, position of pause, and 
duration is defined by the number of the scenes in which the length. The calculated results are transferred to the synchro- 
same lip shape is m ai n ta in ed. nization adjusting unit 15. 



